{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 11 - Logistic Regression Continued\n",
    "\n",
    "The Akimel O'odham people, who were also known as the Pima Indians since European colonization of the US, currently have a high prevalence of diabetes.   The Pima Indian Diabetest dataset contains different possible diabetes indicators and whether the person has diabetes is on [Kaggle](ttps://www.kaggle.com/uciml/pima-indians-diabetes-database) or available [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/diabetes.csv).\n",
    "\n",
    "Load the dataset into a dataframe called `diabetes`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "import statsmodels.formula.api as smf\n",
    "from scipy.special import expit\n",
    "from scipy.stats import logistic\n",
    "\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot a scatter plot of glucose vs. diabetes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit a logistic regression model to this data, using Glucose as the independent variable and Outcome as the dependent variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the equation of the logistic regression model?\n",
    "\n",
    "We can also plot the logistic regression model using Seaborn's `regplot()`.  Use `regplot()` as if you were doing linear regression on the variables, but add in the parameter `logistic = True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assess this model by computing the confusion matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What type of errors are most likely?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the sensitivity and specificity of our model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the sensitivity and specificity, what are the strong and weak points of the model?\n",
    "\n",
    "Remember that by default the confusion matrix uses 0.5 as the cut-off for whether a y value indicates 0  or 1.  We can change that by passing in the new cut-off as a parameter.  For example, to interpret values >= 0.7 as 1 and < 0.7 as 0, use the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How did the confusion matrix change?  Do you think this new model is better or worse?  Recompute the sensitivity and specificity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How did the sensitivity and specificity change?  Are they better or worse?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression with Multiple Independent Variables\n",
    "\n",
    "Let's compute a logistic regression model using all of the columns as independent variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do any of the variables have higher p-values?  If so, let's create a new logistic regression model without those columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's the equation for this logistic regression model?\n",
    "\n",
    "As before, let's compute the confusion matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do you think this is an improvement over the model based only on glucose?\n",
    "\n",
    "Compute the sensitivity and specificity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do the sensitivity and specificity compare to the simpler model?  Play around with the cut-off to see what number gives the best results.\n",
    "\n",
    "### Challenges\n",
    "- To better understand the data, plot the distributions of the Glucose column for people with diabetes and without diabetes as overlapping histograms.  How does this graph compare to the scatterplot of glucose vs. outcome?  Which gives more information?\n",
    "- (Very challenging) A *Receiver Operating Characteristic curve* or *ROC curve* gives information about the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the cut-off changes.  To plot such a curve, use a loop to compute the true positive rate and false positive rate for multiple cut-offs (eg. 0.1, 0.2, ..., 0.8, 0.9), and plot a line plot of these values with the false positive rate on the x axis and the true positive rate on the y axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}